This week’s tidytuesday data comes from Pro Football Reference and includes attendance, standings, and game stats for each game. Well do a quick EDA and generate a few ideas of what might be interesting to look at.
# Import the data from tidytuesday: https://github.com/rfordatascience/tidytuesday
attendance <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/attendance.csv')
standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv')
games <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/games.csv')First, lets take a look at the data and see what it can tell us at face value.
| team | team_name | year | total | home | away | week | weekly_attendance |
|---|---|---|---|---|---|---|---|
| Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 1 | 77434 |
| Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 2 | 66009 |
| Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 3 | NA |
| Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 4 | 71801 |
| Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 5 | 66985 |
| Arizona | Cardinals | 2000 | 893926 | 387475 | 506451 | 6 | 44296 |
| Name | Piped data |
| Number of rows | 10846 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| team | 0 | 1 | 5 | 13 | 0 | 32 | 0 |
| team_name | 0 | 1 | 4 | 10 | 0 | 32 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1.00 | 2009.53 | 5.75 | 2000 | 2005.0 | 2010 | 2015.00 | 2019 | ▇▇▇▇▇ |
| total | 0 | 1.00 | 1080910.03 | 72876.97 | 760644 | 1040509.0 | 1081090 | 1123230.00 | 1322087 | ▁▁▇▆▁ |
| home | 0 | 1.00 | 540455.01 | 66774.65 | 202687 | 504360.0 | 543185 | 578342.00 | 741775 | ▁▁▅▇▁ |
| away | 0 | 1.00 | 540455.01 | 25509.33 | 450295 | 524974.0 | 541757 | 557741.00 | 601655 | ▁▂▇▇▂ |
| week | 0 | 1.00 | 9.00 | 4.90 | 1 | 5.0 | 9 | 13.00 | 17 | ▇▆▆▆▇ |
| weekly_attendance | 638 | 0.94 | 67556.88 | 9022.02 | 23127 | 63245.5 | 68334 | 72544.75 | 105121 | ▁▁▇▃▁ |
We certainly have some interesting statistics here to mess with. It looks like the ‘total,’ ‘home,’ and ‘away’ attendance columns represent the entire season, while the ‘weekly_attendance’ field is for each week. Therefore, it looks like each row is a game. So there should be some variability in how many rows there are amongst the teams, as some make it to the playoff and some do not (therefore they’d have more games). A look at the summary statistics should tell us more… if that is the case or not.
So, as was expected, there does seem to be some variability associated with how many games they play. Not to worry, I’ll probably stay away from analyzing just that, and focus on either more aggregated descriptives or simply other statistics herein.
| team | team_name | year | wins | loss | points_for | points_against | points_differential | margin_of_victory | strength_of_schedule | simple_rating | offensive_ranking | defensive_ranking | playoffs | sb_winner |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Miami | Dolphins | 2000 | 11 | 5 | 323 | 226 | 97 | 6.1 | 1.0 | 7.1 | 0.0 | 7.1 | Playoffs | No Superbowl |
| Indianapolis | Colts | 2000 | 10 | 6 | 429 | 326 | 103 | 6.4 | 1.5 | 7.9 | 7.1 | 0.8 | Playoffs | No Superbowl |
| New York | Jets | 2000 | 9 | 7 | 321 | 321 | 0 | 0.0 | 3.5 | 3.5 | 1.4 | 2.2 | No Playoffs | No Superbowl |
| Buffalo | Bills | 2000 | 8 | 8 | 315 | 350 | -35 | -2.2 | 2.2 | 0.0 | 0.5 | -0.5 | No Playoffs | No Superbowl |
| New England | Patriots | 2000 | 5 | 11 | 276 | 338 | -62 | -3.9 | 1.4 | -2.5 | -2.7 | 0.2 | No Playoffs | No Superbowl |
| Tennessee | Titans | 2000 | 13 | 3 | 346 | 191 | 155 | 9.7 | -1.3 | 8.3 | 1.5 | 6.8 | Playoffs | No Superbowl |
| Name | Piped data |
| Number of rows | 638 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 11 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| team | 0 | 1 | 5 | 13 | 0 | 32 | 0 |
| team_name | 0 | 1 | 4 | 10 | 0 | 32 | 0 |
| playoffs | 0 | 1 | 8 | 11 | 0 | 2 | 0 |
| sb_winner | 0 | 1 | 12 | 13 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 2009.53 | 5.76 | 2000.0 | 2005.00 | 2010.0 | 2014.75 | 2019.0 | ▇▇▇▇▇ |
| wins | 0 | 1 | 7.98 | 3.08 | 0.0 | 6.00 | 8.0 | 10.00 | 16.0 | ▂▆▇▆▂ |
| loss | 0 | 1 | 7.98 | 3.08 | 0.0 | 6.00 | 8.0 | 10.00 | 16.0 | ▂▆▇▆▂ |
| points_for | 0 | 1 | 350.28 | 71.40 | 161.0 | 299.00 | 348.0 | 396.00 | 606.0 | ▂▇▇▂▁ |
| points_against | 0 | 1 | 350.28 | 59.55 | 165.0 | 310.00 | 347.0 | 391.50 | 517.0 | ▁▃▇▆▁ |
| points_differential | 0 | 1 | 0.00 | 101.09 | -261.0 | -75.00 | 1.5 | 72.75 | 315.0 | ▂▆▇▅▁ |
| margin_of_victory | 0 | 1 | 0.00 | 6.32 | -16.3 | -4.70 | 0.1 | 4.57 | 19.7 | ▂▆▇▅▁ |
| strength_of_schedule | 0 | 1 | 0.00 | 1.63 | -4.6 | -1.10 | 0.0 | 1.20 | 4.3 | ▁▅▇▅▁ |
| simple_rating | 0 | 1 | 0.00 | 6.20 | -17.4 | -4.47 | 0.0 | 4.50 | 20.1 | ▁▆▇▅▁ |
| offensive_ranking | 0 | 1 | 0.00 | 4.34 | -11.7 | -3.18 | 0.0 | 2.70 | 15.9 | ▁▇▇▂▁ |
| defensive_ranking | 0 | 1 | 0.00 | 3.57 | -9.8 | -2.40 | 0.1 | 2.50 | 9.8 | ▁▅▇▅▁ |
So, look at this snapshot of the data, there seems to be all kinds of interesting statistics between attendance and what appears to be by team year statistics.
| year | week | home_team | away_team | winner | tie | day | date | time | pts_win | pts_loss | yds_win | turnovers_win | yds_loss | turnovers_loss | home_team_name | home_team_city | away_team_name | away_team_city |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2000 | 1 | Minnesota Vikings | Chicago Bears | Minnesota Vikings | NA | Sun | September 3 | 13:00:00 | 30 | 27 | 374 | 1 | 425 | 1 | Vikings | Minnesota | Bears | Chicago |
| 2000 | 1 | Kansas City Chiefs | Indianapolis Colts | Indianapolis Colts | NA | Sun | September 3 | 13:00:00 | 27 | 14 | 386 | 2 | 280 | 1 | Chiefs | Kansas City | Colts | Indianapolis |
| 2000 | 1 | Washington Redskins | Carolina Panthers | Washington Redskins | NA | Sun | September 3 | 13:01:00 | 20 | 17 | 396 | 0 | 236 | 1 | Redskins | Washington | Panthers | Carolina |
| 2000 | 1 | Atlanta Falcons | San Francisco 49ers | Atlanta Falcons | NA | Sun | September 3 | 13:02:00 | 36 | 28 | 359 | 1 | 339 | 1 | Falcons | Atlanta | 49ers | San Francisco |
| 2000 | 1 | Pittsburgh Steelers | Baltimore Ravens | Baltimore Ravens | NA | Sun | September 3 | 13:02:00 | 16 | 0 | 336 | 0 | 223 | 1 | Steelers | Pittsburgh | Ravens | Baltimore |
| 2000 | 1 | Cleveland Browns | Jacksonville Jaguars | Jacksonville Jaguars | NA | Sun | September 3 | 13:02:00 | 27 | 7 | 398 | 0 | 249 | 1 | Browns | Cleveland | Jaguars | Jacksonville |
From this vantage, it looks like I could connect the team name from the attendance data to the home team of the games data. To do so, I’ll have to make a key that matches. Most readily, it looks like I’ll need to combine a few the ‘team’ and ‘team_name’ columns in the attendance data.
att_reshape <- attendance %>%
mutate(t_name = str_c(team, team_name, sep = " ")) %>%
select(-team, -team_name)
# to merge these datasets, I needed 'week' to be numeric on both...
# and for whatever reason it was a character field to begin with...
games <- games %>%
mutate(week = as.numeric(week))
att_games <- left_join(att_reshape, games,
by = c('t_name' = 'home_team', 'year' = 'year', 'week' = 'week'))
kable(head(att_games))| year | total | home | away | week | weekly_attendance | t_name | away_team | winner | tie | day | date | time | pts_win | pts_loss | yds_win | turnovers_win | yds_loss | turnovers_loss | home_team_name | home_team_city | away_team_name | away_team_city |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2000 | 893926 | 387475 | 506451 | 1 | 77434 | Arizona Cardinals | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2000 | 893926 | 387475 | 506451 | 2 | 66009 | Arizona Cardinals | Dallas Cowboys | Arizona Cardinals | NA | Sun | September 10 | 20:35:00 | 32 | 31 | 322 | 1 | 330 | 2 | Cardinals | Arizona | Cowboys | Dallas |
| 2000 | 893926 | 387475 | 506451 | 3 | NA | Arizona Cardinals | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2000 | 893926 | 387475 | 506451 | 4 | 71801 | Arizona Cardinals | Green Bay Packers | Green Bay Packers | NA | Sun | September 24 | 16:06:00 | 29 | 3 | 455 | 1 | 209 | 4 | Cardinals | Arizona | Packers | Green Bay |
| 2000 | 893926 | 387475 | 506451 | 5 | 66985 | Arizona Cardinals | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 2000 | 893926 | 387475 | 506451 | 6 | 44296 | Arizona Cardinals | Cleveland Browns | Arizona Cardinals | NA | Sun | October 8 | 16:15:00 | 29 | 21 | 315 | 2 | 240 | 0 | Cardinals | Arizona | Browns | Cleveland |
At first glance it looks like there is a bunch of missing data, but because I merged on the home team, almost all the missingness appears to be related to the team is away or if they had a by week. Which makes it less alarming… I was alarmed at first… “did my join work?”
Nevertheless, I think we’re almost at a point to start visualizing stuff. It would be helpful to have a column identifying if a team is home or away… just so I can sort on it later.
att_games_ha <- att_games %>%
mutate(away_team_ind = if_else(is.na(away_team) & !is.na(weekly_attendance), 1, 0),
away_team_ind = case_when(
is.na(weekly_attendance) ~ 9999,
TRUE ~ away_team_ind))So, by doing all of this, I can see trends across time of how well a team does in terms of attendance between home and away games.
Lets take a look.
chiefs <- att_games_ha %>%
filter(t_name == "Kansas City Chiefs")
chiefs_gg <- chiefs %>%
filter(away_team_ind != 9999) %>%
group_by(year, week, away_team_ind) %>%
summarize(average_att = mean(weekly_attendance)) %>%
ggplot(aes(x = year, y = average_att)) +
geom_point() +
facet_wrap(~away_team_ind)
chiefs_ggWell, that’s interesting… but I wonder if the obvious dip in at home attendance between ~2007 and ~2013 is simply a localized artifact to the Chiefs.
all_gg <- att_games_ha %>%
filter(away_team_ind != 9999) %>%
group_by(t_name, year, away_team_ind) %>%
summarize(average_att = mean(weekly_attendance)) %>% ungroup() %>%
ggplot(aes(x = year, y = average_att, color = as.factor(away_team_ind))) +
geom_point() +
facet_wrap(~t_name) +
theme(legend.position = "top")
all_ggSo obviously, each team has a fairly unique pattern of attendance, especially between home and away games. You can also see where some teams got a new stadium that pushed attendance up, such as the Dallas Cowboys.
gg1 <- att_games_ha %>%
filter(away_team_ind != 9999) %>%
group_by(t_name, year, week, away_team_ind) %>%
summarize(average_att = mean(weekly_attendance)) %>%
ggplot(aes(t_name, average_att, color = as.factor(away_team_ind))) +
geom_boxplot(outlier.alpha = .5) +
coord_flip() +
theme(legend.position = "top")
gg1Although both the previous graph do not exhibit any clear reoccuring or dicernable trends, it makes me wonder what the relationship between home attendance and season performance is. Said another way, does a strong home team crowd during the first half of the season, relate to eventual success?
Lets take a look.
standings <- standings %>%
mutate(t_name = str_c(team, team_name, sep = " "))
predict_1 <- att_games_ha %>%
filter(!is.na(away_team),
week < 8) %>%
group_by(year, t_name) %>%
summarize(avg.att = mean(weekly_attendance)) %>% ungroup() %>%
left_join(standings, by = c("year", "t_name")) %>%
mutate(perc.win = wins/(wins+loss))
cor(predict_1$avg.att, predict_1$perc.win) #this is not promissing, but lets visualize## [1] 0.09888602
So much variability. I would assume this is because each team has different stadium sizes, and therefore the y-axis is not relative across teams. Lets see if we can control this.
Let’s first create a function to normalize data within group (year, team)
Group it and then lets see how things relate.
# Full season results were used, although home and away games' attendance will count for more than one team at a time.
predict_2 <- att_games_ha %>%
# filter(!is.na(away_team)) %>%
filter(!is.na(weekly_attendance)) %>%
group_by(year, t_name) %>%
mutate(avg.att = zscore(weekly_attendance)) %>% ungroup() %>%
left_join(standings, by = c("year", "t_name")) %>%
mutate(perc.win = wins/(wins+loss)) %>%
# filter(week < 8) %>%
# filter(week > 4 & week < 10) %>% #trying out different perspectives
# filter(week > 8) %>%
group_by(year, t_name, perc.win) %>%
summarize(avg_first_half = mean(avg.att)) %>% ungroup()Lets take a look now.
predict_2_gg <- predict_2 %>%
ggplot(aes(x = perc.win, y = avg_first_half)) +
geom_point() +
geom_smooth(method = "auto")
# geom_point() +
# geom_smooth(stat = "smooth")
predict_2_ggAlthough first half season attendance does not seem to be extensively predictive, perhaps there are other indicators that could help improve this model. Lets look a bit deeper here.
## [1] -0.2012417
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.506 0.00757 66.8 8.91e-290
## 2 avg_first_half -0.136 0.0263 -5.18 2.97e- 7
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.0405 0.0390 0.189 26.8 2.97e-7 2 158. -311. -298.
## # … with 2 more variables: deviance <dbl>, df.residual <int>
lm2 <- lm(data = predict_2, perc.win ~ year + avg_first_half + year*avg_first_half)
broom::tidy(lm2)## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 8.43e- 1 2.70e+ 0 0.312 0.755
## 2 year -1.70e- 4 1.34e- 3 -0.126 0.899
## 3 avg_first_half 1.15e+17 2.34e+17 0.491 0.624
## 4 year:avg_first_half -5.67e+13 1.16e+14 -0.487 0.626
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.00325 -0.00147 0.193 0.689 0.559 4 146. -283. -260.
## # … with 2 more variables: deviance <dbl>, df.residual <int>
predict_3 <- predict_2 %>%
left_join(standings, by = c("year", "t_name"))
lm3 <- lm(data = predict_3, perc.win ~ year + avg_first_half + points_for)
broom::tidy(lm3)## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 8.60e+ 0 1.81e+ 0 4.75 2.55e- 6
## 2 year -4.39e- 3 9.04e- 4 -4.85 1.53e- 6
## 3 avg_first_half 4.38e+14 4.26e+14 1.03 3.04e- 1
## 4 points_for 2.03e- 3 7.29e- 5 27.9 2.68e-112
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.552 0.550 0.129 261. 3.77e-110 4 402. -793. -771.
## # … with 2 more variables: deviance <dbl>, df.residual <int>